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Abstract 

Semistructured databases require tailor-made concurrency control mech- 
anisms since traditional solutions for the relational model have been shown 
to be inadequate. Such mechanisms need to take full advantage of the 
hierarchical structure of semistructured data, for instance allowing con- 
current updates of subtrees of, or even individual elements in, XML 
documents. We present an approach for concurrency control which is 
document-independent in the sense that two schedules of semistructured 
transactions are considered equivalent if they are equivalent on all possi- 
ble documents. We prove that it is decidable in polynomial time whether 
two given schedules in this framework are equivalent. This also solves the 
view serializability for semistructured schedules polynomially in the size 
of the schedule and exponentially in the number of transactions. 



1 Introduction 



In previous work |S1|HJ[7] we have shown that traditional concurrency control EH 
mechanisms for the relational model [21 ^2 1191 12U] are inadequate to capture 
the complicated update behavior that is possible for semistructured databases. 
Indeed, when XML documents are stored in relational databases, their hierar- 
chical structure becomes invisible to the locking strategy used by the database 
management system. 

*Rocl Vercammen is supported by IWT - Institute for the Encouragement of Innovation 
by Science and Technology Flanders, grant number 31581. 
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In general two actions, on two different nodes of a document tree, that are 
completely 'independent' from each other, cannot cause a conflict, even if they 
are updates. Changing the spelling of the name of one of the authors of a 
book and adding a chapter to the book cannot cause a conflict for instance. 
Most classical concurrency control mechanisms, when applied in a naive way to 
scmistructured data, will not allow such concurrent updates. This consideration 
is the main reason why the classical approaches seem to be inadequate as a 
concurrency control mechanism for scmistructured data. 

Most of the work on concurrency control for XML and semistructured data 
is based on the observation that the data is usually accessed by means of XPath 
expressions. Therefore it is suggested in jS] to use a simplified form of XPath ex- 
pressions as locks on the document such that precisely all operations that change 
the result of the expression are no longer allowed. Two alternatives for conflict- 
checking are proposed, one where path locks are propagated down the XML 
tree and one where updates are propagated up the tree, which both have their 
specific benefits. This approach is extended in [7] where a commit-scheduler is 
defined and it is proved that the schedules it generates are scrializablc. Finally 
in |H an alternative conflict-scheduler is introduced that allows more schedules 
than the previously introduced commit-scheduler. 

A similar approach is taken in 0| where conflicts with path locks are detected 
by accumulating updates in the XML tree and intelligently recomputing the 
results of the path expressions. As a result they can allow more complex path 
expressions, but conflict checking becomes more expensive. Another related 
approach is presented in where locks are derived from the path expressions 
and a protocol for these locks is introduced that guarantees serializability. 

Several locking protocols that are not based on path expressions but on DOM 
operations are introduced in |14II15| . Here, there are locks that lock the whole 
document, locks that lock all the children of a certain node and locks that lock 
individual nodes or pointers between them. An interesting new aspect is here 
the possibility to use the DTD for conflict reduction and thus allowing more 
parallelism. Although these locking protocols seem very suitable in the case of 
DOM operations, it is not clear whether they will also perform well if most of 
the access is done by path expressions. A similar approach, but extended with 
the aspect of multi-granularity locking, is presented in |121 ll.ij . This approach 
seems more suitable for hierarchical data like semistructured data and XML. 
However, such mechanisms will often allow less concurrency than a path based 
locking protocol would. 

A potential problem with many of the previously mentioned protocols is that 
locks arc associated with document nodes and so for large documents we may 
have large numbers of locks. A possible solution for this is presented in [THj 
where the locks arc associated with the nodes in a DataGuidc, which is usually 
much smaller than the document. However, this protocol does not guarantee 
serializability and allows phantoms. 

For all the approaches above it holds that the concurrency control mecha- 
nisms are somehow dependent upon the document. In most cases this means 
that if the document gets very large then the overhead may also become very 
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large. This paper investigates the possibilities of a document-independent con- 
currency control mechanism. It extends the preliminary results on this subject 
that were presented in 

The total behavior of the processes that we consider in this paper is straight- 
forward: each cooperating process produces a transaction of atomic actions that 
are queries or updates on the actual document. The transactions arc interleaved 
by the scheduler and the resulting schedule has to be equivalent with a serial 
schedule. Two schedules on the same set of transactions arc called equivalent iff 
for each possible input document they represent the same transformation 
and each query gives the same result in both schedules. This is a special defini- 
tion of view equivalency, which we will use to decide view serializability 3, for 
a schedule. 

Note that we consider view serializability, as opposed to conflict serializ- 
ability. As we will show later on, conflict serializability, which might be more 
interesting from a computational point of view, will allow less schedules to be 
serialized and hence can be too restrictive. 

The updates that we consider are very primitive: the addition of an edge 
of the document tree and the deletion of an edge. Semantically the addition 
is only defined if the added edge does not already exist in the document tree. 
Analogously the deletion is only defined if the deleted edge exists. A more 
general semantics, that does not include this constraint, can be easily simulated 
by adding first some queries. 

There are some schedules for which the result is undefined for all document 
trees (e.g., a schedule consisting of two consecutive deletions of the same edge). 
These schedules are meaningless and are called inconsistent. Hence a schedule 
is consistent if there exists at least one document tree on which its application 
is defined. We prove that the consistenty of schedules is polynomially decidable. 

In order to tackle the equivalence of schedules and transactions we first 
consider schedules without queries, and as such we have only to focus on the 
transformational behavior of the schedules. We will see that, contrary to the 
relational model, the swapping of the actions cannot help us in detecting the 
equivalence of two schedules. We prove that the equivalence of query less sched- 
ules is also polynomially decidable, and that view serializability is exponentially 
decidable in the number of transactions and polynomially in the number of op- 
erations. Finally we generalize the results above for general schedules over the 
same set of transactions. 

The paper contains a number of theoretical results on which the algorithms 
are based. The algorithms are a straightforward consequence of the given proofs 
or sketches. The complete proofs are given in |16j . 

The paper is structured as follows: Section 2 defines the data model, the 
operations and the scmistructurcd schedules. Section 3 studies the consistency 
of schedules without queries. In Section 4 we study the equivalence and the view 
serializability problem for these query less schedules. In Section 5 we generalize 
these results for consistent schedules. 
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2 Data Model and Operations 

The data model we use is derived from the classical data model for semistruc- 
tured data p^. We consider directed, unordered trees in which the edges are 
labelled. 

Consider a fixed universal set of nodes Af and a fixed universal set of edge 
labels £ not containing the symbol /. 

Definition 1. A graph is a tuple (N, E) with N C TV and E C N x C x N. 
A document tree (DT ) T is a tuple (N, E, r) such that (N, E) is a graph that 
represents a tree with root r. The edges are directed from the parent to the child. 



<document id="0"> 
<person id=" 1 " , age=" 55 " , 
spouse="2"> 
<name>Peter</ name> 
<addr>Parklane 7</addr> 
<child> 
<person id="3", age="22"> 
<name> John</ name> 
<addr>Unistreet K/addr> 
<hobby>swimining< /hobby > 
<hobby>cycling</hobby> 
</person> 
</child> 
</person> 

<person id="2", age-"43"> 
<name>Mary</name> 
<addr>Parklane 7</addr> 
<hobby>painting</hobby> 

</person> 

</ document> 




Figure 1: A fragment of an XML document and its dt representation. 



Example 1. Figure^ shows a fragment of an XML document and its DT rep- 
resentation. 

This data model closely mimics the XML data model as illustrated in the 
next example. We remark, however, the following differences: 

• order: Siblings are not ordered. This is not crucial, as an ordering can 
be simulated by using a skewed binary DT. 

• attributes: Attributes, like elements, are represented by edges labeled 
by the name of the attributes (started with a @) . The difference is that 
in this data model an element may contain several attributes of the same 
name. 

• labels: Labels represent not only tag names and attribute names, but 
also values and text. 

• text: Unlike in XML, it is possible for several text edges to be adjacent 
to each other. 
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A label path is a string of the form l\j . . . jl m with m > and every li an 
edge label in L. Given a path p = ((ni, Zi, 712), . . . , (n TO , Z m , n TO +i)) hi a graph 
G, the Za&eZ path of p, denoted Xt(p) (or X(p) when T is subsumed) is the string 
l\j ... jl m . 

Processes working on document trees do so in the context of a general pro- 
gramming language that includes an interface to a document server which man- 
ages transactions on documents. The process generates a list of operations that 
will access the document. In general there are three types of operations: the 
query, the addition and the deletion. The input to a query operation will be a 
node and a simple type of path expression, while the result of the invocation of a 
query operation will be a set of nodes. The programming language includes the 
concepts of sets, and has constructs to iterate over their entire contents. The 
input to an addition or a deletion will be an edge. The result of an addition or 
a deletion will be a simple transformation of the original tree into a new tree. 
If the result would not be a tree anymore it is not defined. 

We now define the path expressions and the query operations, subsuming a 
given dt T . 

The syntax of path expressions 1 is given by V: 

V ::= pe, \ V + 

V + ::= T I V + /T \ 

T ::= * I £ 

The set L(pe) of label paths represented by a path expression pe is defined 
as follows: 

L(pe £ ) = {e} 

L(*) = C 

Mi) = {i} 

L(pe/f) = L(pe).{/}-L(/) 
L(pe//f) = L(pe) •{/}.(£•{/})*• L(/) 

Let n be an arbitrary node of T and pe a path expression. We now define the 
three kinds of operations: the query, the addition and the deletion. 

Definition 2. The query operation query(n,pe) returns a set of nodes, and is 
defined as follows: 

• query (n,pe) with n € J\f and pe G V ■ The result of a query on a DT T is 
defined as query(n,,pe)[T] = {n' £ N \ 3p a path in T from n to n' with X(p) £ 
L(pe)}. 

The update operations add(n, I, n') and del(n, I, n') return no value but trans- 
form a DT T — (N,E,r) into a new DT V = (N',E',r): 

1 Remark that path expressions form a subset of XPath expressions. 
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• add(n, /, n') with n,n' £ TV and I £ £. 77ie resulting T' = add(n, i, n')[T] 
is defined by E' = E U {(ra, Z, n')} and N' = N U {n'}. If the resulting T 
is not a document tree anymore or (n,l,n') was already in the document 
tree then the operation is undefined. 

• del(n, l,n') with n,n' £ Af and I £ C. The resulting T' = del(n, I, n') [T] 
is defined by E' = E - {{n, I, n')} and N' = N - {n 1 }. If the resulting T' 
is not a document tree anymore or (n, I, n') was not in the document tree 
then the operation is undefined. 

Note that the operations explicitly contain the nodes upon which they work. 
As we will explain in Section 4 this is justified by the fact that the scheduler 
decides at run time whether an operation is accepted or not. 

We now give some straightforward definitions of schedules and their seman- 
tics. 

Definition 3. An action is a pair (o,t), where o is one of the three operations 
qucry(n,pe) ; add(n, l,n') and del(n,l,n') and t is a transaction identifier. A 
transaction is a sequence of actions with the same transaction identifier. A 
schedule over a set of transactions is an interleaving of these transactions. The 
size ns of a schedule S is the length of its straightforward encoding on a Turing 
tape . 

We can apply a schedule S on a DT T. The result of such an application is 

• for each query in S, the result of this query. 

• the dt that results from the sequential application of the actions of S; this 
dt is denoted by S[T] 

If some of these actions are undefined the application is undefined. Two sched- 
ules are equivalent iff they are defined on the same non-empty set of dts and on 
each of these dts both schedules have the same result. The definition of serial 
and serializable schedules is straightforward. 

Since a transaction is a special case of a schedule all the definitions on 
schedules also apply on transactions. 

Note that the equivalence of schedules and transactions is a document- 
independent definition. Let 
T\ = ({m,ri2},{(ni,Z2,Ti2)},ni), 
T 2 = ({ni, n 2 }, {(ni, h, ra 2 )}, m), 
T3 = ({ni},0,m) be three dts and let 

Si = (add(n 2 ,Z2,n3),ti),(query(ni,Zi/Z 2 ),t2), 
5*2 = (query(ni, h/h), fa), (add(n 2 , h, 1^3), fa) be two schedules. 
S\ and 5*2 are equivalent on T\, they are not equivalent on T 2 and their appli- 
cation is undefined on T3. 
Let S3 be the empty schedule and 

2 We assume that nodes can be encoded in 0(l)-space 
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S 4 = (a,dd(n 1 ,l 1 ,n 2 ),t 1 ), (del(n l7 l 1 ,n 2 ),t 2 ). 

S3 and S4 are not equivalent although they are equivalent on many dts. 

Wc will later on use the definition of equivalence to define serializability. In 
this paper we study view serializability, which is less restrictive than conflict 
serializability. We illustrate this claim by introducing informally a scheduling 
mechanism for generating conflict scrializable schedules. A possible approach 
for this is to have a locking mechanism where operations can get locks, and in 
which a new operation of a certain process will only be allowed if it does not 
require locks that conflict with locks required by earlier operations. Because 
operations with non-conflicting locks can be commuted, any schedule that is 
allowed by such a scheduler can be serialized. The following example shows, 
however, that the reverse does not hold: Indeed, the next schedule 

S= (add(r,Zi,rii),ii), (del(r,Zi,ni),i 2 ), 
(add(r, l 2 , n 2 ), t 2 ), (del(r, l 2 , n 2 ), t 2 ), 
(add(r, l 2 , n 2 ), t x ), (del(r, l 2l n 2 ), t x ). 

is consistent since it is defined on T = ({r},0,r). Furthermore it is serializ- 
able, and the equivalent serial schedules are 

51 = (add(r, ?i,ni),ti), (add(r, l 2 , n 2 ), t x ), 

(del (r, h , n 2 ) , h) , (del (r, h , n x ) , t 2 ) , 
(add(r, l 2 , n 2 ), t 2 ), (del(r, l 2 ,n 2 ),t 2 ) 

5 2 = (dc\(r, /i,ni),i 2 ),(add(r, l 2 ,n 2 ),t 2 ), 

(del(r, l 2 ,n 2 ),t 2 ), (add(r, l x , ni), ii), 
(add(r, l 2 , n 2 ), h), (del(r, l 2 , n 2 ), h). 

but we cannot go from S to S\ nor to S 2 only by swapping with consistent 
intermediate schedules. This illustrated that an approach based on conflict 
serializability can be too strict. 

3 Consistency of Queryless Schedules 

A schedule is called queryless (QL) iff it contains no queries. Because of the way 
that operations can fail it is possible that the application of a certain transaction 
is not defined for any document tree. We are not interested in such transactions. 
We call a transaction t consistent iff there is at least one dt T with t[T] defined. 

Example 2. The next transaction is consistent: 

(add(r,h,ni),ti), (del(r, h, ni), tx), (add(r, l 2 , n 2 ), t x ), 

(dcl(r,l 2 ,n 2 ),ti), (&dd(r,l 2 ,n 2 ),ti), (del(r, l 2 , n 2 ), h). 
Note, however, that there are DTs on which this transaction is undefined. For 
example, if T contains an edge (r,l3,ni), then t x [T] is undefined, since the 
application of the first action of t\ is undefined. 
The next transaction is inconsistent: 

(add(ni, l x , n), t x ), (add(n 2 , l 2 , n), t{). 
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We call a schedule S consistent iff there is at least one dt T with S[T] 
defined. Remark that there are consistent schedules that cannot be serializable 
because they contain an inconsistent transaction. For instance, the consistent 
schedule S = (add(r, Zi,m),ii), (del(r, h, rii), t 2 ), (add(r, l\, ni), t\) is defined 
on T = ({r}, 0, r), and hence is not serializable, because every equivalent serial 
QL schedule would be undefined (since the transaction t\ is not consistent). 
Transaction ti has the property that all QL schedules over a set of transactions 
that contain t\ are non-serializablc. 

Note that the definition of consistent QL schedule is document-independent. 
It is clear that we are only interested in consistent transactions and schedules. 
Remark also that if two QL schedules are equivalent then they are both consis- 
tent. This equivalence relation is defined on the set of consistent QL schedules. 

We will characterize the consistent QL schedules and prove that this prop- 
erty is decidable. For this purpose we will first attempt to characterize for which 
document trees a given consistent QL schedule S is defined, and what the prop- 
erties are of the document trees that result from a QL schedule. We do this by 
defining the sets N™ m (S), N™ ax (S), E™ m {S) and E™ ax (S), whose informal 
meaning is respectively the set of nodes that are required in the input dts on 
which S is defined, the set of nodes that are allowed, the set of edges that are 
required and the set of edges that are allowed. In the same way we define the 
sets Ng in (S), N% ax {S), E% ln {S) and E% ax (S) taking into account the output 

DTS. 

Definition 4. Let S be a QL schedule. (f>s{n,6) (<ps(( m , U n ), o) ) indicates 
that the first occurrence of the node n (the edge (m,l,n)) in the schedule S has 
the form of the operator o. 3 Ag(n, o) (Xs((m,l,n),o)) indicates that the last 
occurrence of the node n ( the edge (m, l,n)) in the QL schedule S has the form of 
the operation o. We define the sets iVp in (5), N™ ax (S), Ef m {S) andEj lax (S), 
and the sets N% m (S), N% ax (S), E% m (S) and E% ax {S) as in Figure^ 
A dt T is called a basic input tree (basic output tree ) of S iff it contains all the 
nodes ofN} mn {S) (N% in (S)), only nodes of Np ax (S) (N% ax (S)), all the edges 
ofEf in (S) (E% in (S)) and only edges of Ef ax {S) (E% ax (S)). 

Consider S = (add(ni, h,n 2 ), t\), (del(ri4, 1%, 77.3), t 2 ), (del(ni, 1%, 724), £3) then 

N™ n (S) ={n 1 ,n z ,n i } 
N™ ax (S) = Af-{n 2 } 
E™ in {S) ={(ri4,l2,n 3 ),(n 1 ,l 1 ,n A )} 
E™ ax (S) = E™ m (S) U{(m,l,n)e 

Af x C x M I to, n ^ ?t-2, tt-3, 714} 
N& in (S) ={ni,n 2 } 
N% ax (S)=Ar-{n 3 ,n 4 } 
E% m (S) ={( ni ,h,n 2 )} 
E% ax {S) = E% m (S) U {(m,l,n) € 

TV x C x W I to, 71 ^ 7i 2 , 713, 714} 

3 For example, 4>s( n 2, add(r, I2, ^2)) holds in the consistent QL schedule in Example 121 
above. 
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{m | (/)s( m , add(m, Z, n))} U {m \ 4>s( m , del(m, Z, n))} U {n \ (f>s(n,del(m,l,n))} 
Af — {n | 0s(n, add(m, Z, n))} 
{(m,l,n) | 0s((m,f,n),deI(m,Z,n))} 

E} nm (S) U {(m,Z,n) | no (mi,h,m) nor (mi,Zi,n) occurs in 5 1 } 

{m | As(m, del(m, Z, n))} U {m | As^m, add(m, Z, n))} U {n | As(n, add(m, Z, n))} 

A/" — {n | As(n, del(m, Z, n))} 

{(m,l,n) | As((m, Z, n), add(m, Z, n))} 

EQ m (S) U {(m, Z,n) | no (mi,Zi,m) nor (mi,Zi,n) occurs in 5} 



Figure 2: The Definition of the basic input and output sets. 

We will prove in Theorem ^ that the application of a consistent schedule S 
is defined on each basic input tree of S. 

Although Np ax (S), Ef ax {S), Ng ax {S) and E% ax (S) are in general infinite, 
they can be represented in a finite way: Np ax (S) by {n | 4>s(n, add(m, Z, n))} } 
Ef ax (S) by E? in (S)u{n \ there is a (mi,h,n) that occurs in S}, N$ ax (S) by 
{n | As(n,del(m,Z,n))},^g laa; (S') by ^ in (5)U{n | there is a (mi,Zi,n) that occurs in S}. 

Lemma 1. Let S be a schedule with size n s . N} nin (S), N™ ax (S) , Ej nin (S) , E™ ax (S), 
N$ m (S),N$ ax (S),E% m (S) andE% ax {S) can be calculated m 0(n s .log(n s ))- 
time and in 0(ns)-space. For each of these sets and for any node or edge it is 
decidable in Oins)-time and 0(log(ns))-space whether the node or edge is in 
the set. 

Proof. (Sketch) We can decide whether a node or an edge is in one of the basic 
input or output sets by examining the actions of the schedule S. □ 

When a QL schedule is inconsistent this is always because two operations in 
the QL schedule interfere, as for example the two operations in the inconsistent 
transaction of Example (add(ni, Zi, n), t±) and (add(ri2, fo,n>), t\). If these 
two operations immediately follow each other then at least one of them will 
always fail. However, if between them we find the action del(ni, Zi, n) then this 
does no longer hold. The following definition attempts to identify such pairs of 
interfering operations and states which operations we should find between them 
to remove the interference. 

Definition 5. A QL schedule fulfills the C-condition iff 

1. //add(n, Zi, and add(ri2, n) appear in that order in S then del(n, Zi, n-i) 
appears between them. 

2. I/add(ni, Zi, n) and add(?i2, Z2, n) appear in that order in S then del(ni, Zi, n) 
appears between them. 

3. //add(n, Zi, rii) and del(rt2, Z2, n) appear in that order in S then del(n,Zi,ni) 
appears between them. 



N} nm (S) = 
Np ax (S) = 
Ef m {S) = 
Ef ax {S) = 
N% m {S) = 
N$ ax (S) = 
E% in {S) = 
E% ax (S) = 
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4- If add(ni, l\, n) andde\(n,l2,n2) appear in that order in S then add(n ,12,112) 
appears between them. 

5. If add(ni, l\, n) and del(ri2, I2, n) appear in that order in S and {ni,l\) ^ 
(712,^2) then del(ni, l\, n) appears between them. 

6. If del(n,li,ni) and add(ri2, ^2, ?i) appear in that order in S then some 
del(ri3, ^3, n) appears between them. 

7. If del(ni, l\, n) and add(n, I2, ^2) appear in that order in S then some 
add(ri3,Z3,n) appears between them. 

8. If del(ni, l\, n) and del(n, I2, 112) appear in that order in S then some 
add(ri3,Z3,n) appears between them. 

9. 7/dcl(rii, l\, n) anddcl(n2,l2,n) appear in that order in S then add(ri2, Z2, ^) 
appears between them. 

The following theorem establishes the relationship between consistency, basic 
input trees and the C-condition. 

Theorem 1. The following conditions are equivalent for a QL schedule S: 

1. there is a basic input tree of S and the application of S is defined on each 
basic input tree of S. 

2. there is a basic input tree of S on which the application of S is defined; 

3. S is consistent; 

4- S fulfills the C-condition; 

5. there is a tree on which the application of S is defined and all trees on 
which the application of S is defined are basic input trees of S. 

Proof. (Sketch) Clearly 1 — > 2 — > 3 — > 4 and 5-^3. We prove that 4 implies 
1. First we prove that there is a basic input tree for which S is defined. Then 
we prove that the application of S is defined on each basic input tree of S by 
induction on the length of S. Finally 3 implies 5. Indeed, let S be defined on 
T, where T is not a basic input tree of S. T does not satisfy one of the four 
conditions of Definition In each case this yields a contradiction. □ 

Corollary 1. It is decidable whether a QL schedule or a transaction is consis- 
tent in 0(n? s )-time and 0(ns)-space. 

Proof. (Sketch) This follows from the decidability of the C-condition and The- 
orem ^ n 

For the basic input and output sets we can derive the following property: 

Property 1. If S is a consistent QL schedule then EJ lm (S) and Eq 171 (S) are 
forests. 
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By ADD (S) we denote the set of edges that are added by the QL schedule 
S, i.e., they are added without being removed again afterwards, and by DEL (S) 
we denote the set of edges that are deleted by the QL schedule S, i.e., they are 
deleted without being added again afterwards. 

Definition 6. Let S be a consistent QL schedule. We denote 

ADD(S) = {(m,l,n) | Xs((m, I, n), add(m, I, n))} 
DEL(S) = {(m, I, n) \ \ s ((m,l,n),del(m,l,n))} 

We call ADD(S) the addition set of S and DEL(S) its deletion set. 

Remark that two consistent QL schedules with the same ADD and DEL are 
not necessarily equivalent. Indeed Si = (del(ni, li, 712), £2) and S2 = (add(ni, h, nz), ti), 
(del(m, h, n 2 ), t 2 ) arc not equivalent although ADD (Si) = ADD(S 2 ) and DEL(Si) = 
DEL(S 2 ). 

Lemma 2. Let S be a consistent QL schedule and T be a basic input tree of S. 
S[T] = TU ADD(S) - DEL(S) is a basic output tree 4 . 

Proof. (Sketch) Clearly T U ADD(S) - DEL(S) is the result of the application 
of S on T. We verify that T U ADD(S) - DEL(S) is a basic output tree. □ 

The following lemma establishes the relationships between the addition and 
deletion sets, and the basic input and output sets. 

Lemma 3. Let S be a consistent QL schedule. 

N m. m = ( N rnin _ | n | S(m,l,Tl) g DEL(S)}) U 

{n I 3{m,l,n) G ADD(S)} 

N max = ^max _ | n | g( m ^ ( g DEL(S)}) U 

{n I 3(m,l,n) G ADD(S)} 

E min = ( E min _ DEL(S)) U ADD(S) 

E max = (gp _ deL(S)) U ADD(S) 

Proof. (Sketch) Results from Lemma □ 

4 Equivalence and Serializability of QL Sched- 
ules 

The purpose of a scheduler is to interleave requests by processes such that the 
resulting schedule is serializable. This can be done by deciding for each request 
whether the schedule extended with the requested operation is still serializable, 
without looking at the instance. In this section we discuss the problem of 
deciding whether two consistent QL schedules are equivalent, and whether a 
consistent QL schedule is serializable. 

To begin with, it can be shown that the application two QL schedules over 
the same set of transactions on the same dt T result in the same DT, if they 
are both defined. 

4 We consider a graph as the set of its edges and vice versa. 
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Lemma 4. Let S and S' be two QL schedules over the same set of transactions. 
S[T] = S'[T] if S[T] and S'[T] are both defined. 

Proof. (Sketch) Considering a given edge, this edge is alternatively added and 
deleted in each of the applications. Since the two QL schedules arc over the 
same set of transactions, the edge belongs to no result or to both results □ 

As a consequence the problem of deciding whether two consistent schedules 
over two given transactions are equivalent reduces to the problem of deciding 
whether their result is defined for the same dts, which can be decided with the 
help of the basic input and output sets. 

Theorem 2. Two consistent QL schedules Si, S2 over the same set of trans- 
actions are equivalent iff they have the same set of basic input trees, ie. iff 
NY 1171 {Si) = N^ m {S 2 ), N^iSi) = NY iax (S 2 ), Ef in (Si) = Ef m {S 2 ) and 
E™ ax (Si) = EY lax (S 2 ). Hence their equivalence is decidable in 0{ns-log{ns))- 
time and 0(ns)-space. 

Proof. From Lemma ^and Lemma QJ □ 

Note that this theorem does not hold for two arbitrary QL schedules. Indeed 
Si = (add(m, /, n), t) and S 2 = (add(m, /, ri),t), (del(m, l,n),t) have the same 
basic input trees and are not equivalent. 

We can use the basic input and output sets to decide whether one consistent 
schedule can directly follow another consistent schedule without resulting in an 
inconsistent schedule. 

Lemma 5. Let Si and S 2 be two consistent QL schedules. Let ns be the size of 
Si.S 2 . Si.S 2 is consistent iff N} nin {S 2 ) C N% ax (Si), Ef in {S 2 ) C E% ax (Si), 
N% in {S x ) C N} nax (S 2 ), E% in (Si) C Ej lax (S 2 ). The consistency of Si.S 2 is 
decidable in 0{ns-log(ns))-time and 0(ns)- space. 

Proof. (Sketch) A result of the C-conditions, Lemma ^ and Theorem 

□ 

The following lemma shows how the basic input and output sets can be com- 
puted for a concatenation of schedules if we know these sets for the concatenated 
schedules. 

Lemma 6. Let Si, S2, S n and Si.S 2 ...S n be (n+ 1) consistent QL schedules. 
Then 

Nr n (Si...s n ) - utA N r n (s l ) n n k<l Nr x (Sk)) 

N™*(Si...S n ) = (T=i( N r ax (S*) U U<; N™ m {S k )) 
E™ n {Si...S n ) = \Jti(E? in (Si) n n fc<l E™ ax {S k )) 
E™ ax {Si...S n ) = {X =l {E™ ax {Si) U \J k<l E™ n {S k )) 

If ns is the size of Si.S 2 ...S n then these equalities can be verified in 0{n s )- 
time and 0(ns)-space. 



Proof. By induction using Definition 0] 



□ 
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Finally, the previous theorems can be used to show that serializability is 
decidable. 

Theorem 3. Given a QL schedule S ofrit transactions. It is decidable whether 
S is serializable in 0(n™' .n 3 s )-time, and in 0{n 2 s )-space. 

Proof. (Sketch) Indeed, 

1. we verify whether each transaction is consistent, which is done in 0(n 3 s .n t )- 
time and in 0(ng)-space (Corollary 2J; 

2. we draw a graph that indicates which transactions can follow directly 
which other transactions (i.e. Ti.Tj is defined), which is done in 0(n 2 .ns-log(ns))- 
time and in 0{n\ + ns)-space (Lemma EJ); 

3. S is serializable iff there is a Hamilton path that is equivalent with S; to 
verify this: 

(a) we calculate the ordered Nf 1 ™, N} nax ,Ef m and Ef ax of the trans- 
actions, which is done in 0{n t .ns .log(ns))-time and 0(rit.ns)-space 
(Lemma QJ; 

(b) there are 0(n™*) Hamilton paths, for each of them: 

i. we verify its consistency, which is done in 0(n|)-time and 0{ns)- 
space ( Corollary QJ; 

ii. we calculate the ordered TV™", N} nax , E™ in and E} nax of the 
Hamilton path, which is done in 0(ns .log(ns))-time and 0(ns)- 
spacc (Lemma .QJ; 

iii. Lemma and Theorem El are verified in 0(ng)-time and in 
0(ns)-space. 

□ 

5 Equivalence and Serializability of Schedules 

In the previous section we only considered QL schedules, but in this section 
wc consider all schedules. We start with generalizing the notions that were 
introduced for QL schedules. 

Definition 7. A schedule S is called consistent iff its corresponding QL sched- 
ule S' is consistent. ADD(S') = ADD(S'') where S' is the QL schedule of S . 
Analogously for DEL, Ef ln , Ef ax , E^ in , E% ax , Nf in , N\ nax , N% in , N% ax . 

To verify whether two consistent schedules over the same set of transactions 
are equivalent, we first eliminate the queries and verify whether the resulting 
QL schedules are equivalent. (Cfr. Theorem[2J). In this section we investigate 
the equivalence of two consistent schedules over the same set of transactions 
and whose QL schedules are equivalent. In the following examples it is shown 
that such schedules can be equivalent on all the DTs they are defined on, on 
only some of them or on none. 
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Example 3. Let h =/= l 3 . Consider the following schedules: 

51 = (add(n 2 ,Z 2 ,n 3 ),ii), (query(m, h/h), t 2 ), 

(del(n 2 , h, n 3 ), h), (del(ni, ^ 3 , n 2 ), h) 

5 2 = (qu.ery(nx,h/h),t 2 ), (add(n 2 , l 2 , n 3 ), ti) ; 

(dcl(n 2 , l 2 , n 3 ), h), (del(ni, l 3 , n 2 ), h) 

Si and S 2 are correct and their corresponding QL schedules are equal. They 
are equivalent on all dts on which they are defined, hence they are equivalent. 
Consider the following schedules S3 and S±: 

5 3 = (a,dd(n 2 ,l 2 ,n 3 ),t 1 ), (query(ni, h/h), t 2 ), 

(del(n 2 ,Z 2 ,n 3 ),ii) 
Si = (qaery(ni,h/l 2 ),t 2 ), (add(n 2 , l 2 , n 3 ), h), 
(del(n 2 ,/ 2 ,n 3 ),ti) 

S 3 and S4 are consistent and their corresponding QL schedules are equal. They 
are equivalent on some DTs on which they are defined and not equivalent on 
others. 

Finally, let S5 and Se be the following schedules: 

5 5 = (add(n 2 ,/ 2 ,n 3 ),ii), (qucry(rii, h/h), t 2 ), 

(dcl(n 2 , l 2 , n 3 ), h), (del(ni, h, n 2 ), h) 

5 6 = (query (nx,h/h),h), (&dd(n 2 ,l 2 ,n 3 ),t 1 ), 

(dcl(n 2 , l 2 ,n 3 ),tx), (del(ni, h, n 2 ), h) 

S*5 and Sq are consistent and their corresponding QL schedules are equal. They 
are, however, equivalent on no dt on which they are defined. 

In order to prove the decidability of the equivalence of two schedules over 
the same set of transactions we first define the notion of SOP, Set Of Prefixes 
in Subsection 5.1, and some additional notation in Subsection 5.2. 

5.1 SOP - Set Of Prefixes 

Informally, the notion "Set Of Prefixes" (SOP) of a path expression pe for a 
label path Ip, will allow us to find a set of path expressions pe', such that all 
path expressions pe 1 /lp together represent exactly these label paths of pe that 
end on lp. For example, consider the path expression pe = 6//* and the label 
path lp = a. Then b/a and b// * /a represent the label paths of pe that end 
with label path a. Hence b and b/ /* are a-prefixes of b/ /*. 

We will now define the set of non-empty Zp-prefixes inpe, denoted as SOP (pe)i p 
as a set of path expressions that together represent the set of label paths pe' 
such that pe' /lp £ L(pe) 5 . For instance SOP(fr//*)a = {b, b/ /*}. 

Definition 8. Let pe be a path expression, lp be a label path and I G C The 
set of non-empty Ip-prefixes in pe, denoted as SOP(pe); p is defined by 

5 We consider pe/e to be equal to pe. 
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SOP(pe) e = {pe} 

SOP(pe/*)i = SOP(pe/0j = {pe} 

SOP(pe//*), = SOP(pe///)i = {pe,pe//*} 

SOP(pe/*)z P /z = SOP(pe//) ;p/i = SOP(pe) ip 

SOP(pe//*) ip/l = SOP(pe//0i P /i 

= SOP(pe) /p U SOP(pe//*) /p 
Otherwise SOP '{pe)i p = 0. 
Furthermore we define L(SOP(pe) /p ) = U pei esoP( P e) lp L (P e *)- 

Lemma 7. L(SOP(pe) /p ) = {V | Ip'/lp 6 L(pe)}. 

Example 4. 

• SOP(a/ * / * /b) a/b = SOP(a/ * /*)„ = {a/*} 

. SOP(a//*/c) a/6/c = SOP(a//*) a/h = SOP(a) a USOP(a//*)a = {a, a//*} 

. SOP(*//*) a/6/c = SOP(*) a/fe USOP(*//*) a/6 = 0USOP(*) Q USOP(*//*)a = 
0U0U {*,*//*} = {*,*//*} 

. SOP(a//b//d) b/c/d = {a,a//*,a//b,a//b//*} 

Lemma 8. Let pe be a path expression and Ip be a label path. SOP(pe); p = 
{pe 1 | pe' a prefix of pe and ~L(pe' /lp) C h(pe)} U {pe' / / * | pe' a prefix of pe 
and L(pe'// * /lp) C L(pe)}. 

Lemma 9. Lei pe be a path expression of length n pe and lp be a label path. 
SOP(pe)/ p is uniquely defined, finite and is computable in 0{n^ e .{n pe + ni p ))- 
time and 0(log(n pe + ni p ))-space. 

Proof. From Lemma|Hlwe know that we have to calculate the two sets: {pe' | pe' 
a prefix of pe and L(pe'/lp) C L(pe)} and {pe' j j * | pe' a prefix of pe and 
L(pe'// * /Zp) C L(pe)}. Hence SOP(pe)/ p can be calculated in 0(rip e )-time 
[T5] . and in 0(rip e )-space. □ 

Lemma 10. Let pe be a path expression and lp\ and Ipi be two label paths. 
L(SOP(pe) ipi ) C L(SOP(pe) ip2 ) iffVpa e SOP{pe)i Pl (L(pe 4 /Zp 2 ) C L(pe)). 

Proof. From Definition [S] and Lemma [7| □ 

Theorem 4. Lei pe 6e a pai/i expression and lp\ and Ipi be two label paths. 
It is decidable in 0(n pe .(n pe + ni p ))-time and in 0(n pe + log(n pe + ni p ))-space 
whether L(SOP(pe) ipi ) = L(SOP(pe); P2 ). 

Proof. From Lemmas I§1 and UHl 

□ 
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5.2 PQRN - Potential Query Result Nodes 

The main concept that is introduced in this subsection is the set of Potential 
Query Result Nodes (PQRN) for a query Q in a schedule S. This set will 
contain all nodes n, that are added or deleted in S, and for which there exists a 
document tree T, such that n is in the result of the query Q when S is applied 
to T. 6 . For this puprose, we need to introduce some additional notations to 
characterize the trees on which a query Q in a schedule S will be executed. We 
will use these notations later on, and we also give some complexity results for 
calculating the value of these concepts. 

Let S be a consistent schedule that contains the query Q = query (n,pe). 

• We denote by S" 3 the actions of S that occur before Q; is called a 
subschcdulc of S; 

• Let T be a basic input tree of S. We define T Q = S Q [T] as the dt on 
which Q in S is evaluated; hence the result of the application of the query 
Q in S is Q[T% 

• We denote by E mm (S®) as the set that contains exactly those edges 
that are required in T Q ; This set is equal to {E} nm {S) - DEL(S"3)) U 
ADD(5"2) (Lemma EJ); 

• Wc denote by E max (S < ^) as the set that contains exactly those edges that 
are allowed in T Q ; This set is equal to (EY iax (S)-T>EL(S Q ))LlADD(S Q ) 
(Lemma EJ- 

E mm (S Q ) is a forest (Property HJ). As such every node m of E mm (S < ^) has a 
unique ancestor without a parent in E mm (S < ^); it is denoted by ARoot(S® , m). 
The label of the path from ARoot{S Q ,m) to m in E mm {S Q ) is denoted by 
ALabel{S Q ,m). 

Lemma 11. Alabel(S® ,m) and ARoot(S® ,m) can be computed in 0(ng) -time 
and 0(ns)-space. 

Proof. A consequence of Lemma ^ 

□ 

If add(m, I, n) or del(m, I, n) are operations of S we say that n is a non- 
building-node of S. Otherwise n is called a building-node of S . Note that 
E max^ S Q^ = E rmn( S Q^ y { edges t j iat con t a i n only building nodes} since 

E mm (S Q ) = (E^ in (S) - BEL{S Q )) U ADD(S" 3 ), 

E max (S Q ) = (E} nax {S) - DEL(5 Q )) U ADD(S" 5 ) and 

Ef ax (S) = Ef in {S) U {edges that contain only building nodes}. 

We will now define the set of nodes PQRN(S, Q). This set will contain all 
non-building-nodcs that can be in the result of a query that starts with a node 
n that is not in E mm (S®). After the formal definition we will show that this 
definition corresponds to this informal description. Finally we will show that 
this set is computable in polynomial time and space. 

6 This notion is only defined for a subset of queries, which will be specified later on. 
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Definition 9. Let S be a consistent schedule that contains a query Q = query(n,pe). 
We define the set PQRN(S, Q) as: 
PQRN(S,Q) = {m| 

• ma node in the graph E min (S Q ); 

• ma non-building-node; 

• ARoot(S® , to) a building-node; 

• ARoot(S Q 1 m) ^ n: 

• L(SOP(pe) j4Labe;(S Q, m) ) ^ 

}■ 

Lemma 12. Let S be a consistent schedule, Q = query(n,pe) a query that ap- 
pears in S , and n a node that is not in the graph E mln (S Q ). Then PQRN(S', Q) 
is the set of non-building-nodes m, such that there exists a basic input tree T of 
S for which to is in the result of the query Q on the document tree [T] . 

Lemma 13. PQRN(S f , Q) can be computed in 0(ng)-time and 0(ns)-space. 

Proof. From Theorem 0] and Lemma. ITT1 □ 

5.3 Decidability of Equivalence 

We will now establish the main result of this paper by proving that the equiva- 
lence of two schedules is decidable in our framework. 

Lemma 14. Given two consistent schedules Si and S2 over the same set of 
transactions and whose QL schedules are equivalent. Let Q = query (n,pe) be 
a query in these schedules and let n a be the total number of actions in Si and 
S2. It is decidable in 0(n G s )-time and 0(ns)-space whether Q gives the same 
answer in Si as in S2 for every possible basic input tree of Si and S2 ■ 

Proof. The next condition CND(Si, S2,Q) detects when Q gives the same an- 
swer in Si as in S2 for every possible basic input tree of Si and S2 : 
Definition of CND(S U S 2 , Q) 

1. {to I there is a path of L(pe) from n to to in E min {S ( {)} = {m | there 
is a path of L(pe) from n to m in E mm (S2 )}; this can be done in 0(n|) 
time; this is a consequence of a result in |18j 

2. furthermore, if n is a building-node of St: 

(a) PQRN(Si,Q) = PQRN(S 2 ,Q) 

(b) for the nodes to e PQRN(Si,Q) hold that 

i. ARoot(S?,m) = ARoot{S$,m) 

ii. USOP(pe) ALabel(s Q tm) )=L(SOP(pe) ALabel{S Q m) ) 
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All this can be computed in 0(ri|)-time and in 0(ns)-space. 

□ 

The definition of the CND condition is illustrated in the following example. 

Example 5. In Example^ we have 

. E mm (S?) = {(n u l 3 ,n 2 ),(n 2 ,l 2 ,n 3 )} and E mm (S%) = {(m, l 3 , n 2 )}; 1. 

is fulfilled; n\ is a building-node; n 2 andn^ are non-building-nodes; PQRN(Si, Q) = 
PQRN(S < 2, Q) = 0; hence CND(Si, S 2 , Q) is fulfilled and Q gives the same 
answer in Si as in S 2 for every possible basic input tree of Si and S 2 . 

• E mm {S®) = {(n 2 ,l 2 ,n 3 )} and E mm (Sf) = 0; 1. is fulfilled; n 2 is a 

building-node; n 3 is a non-building-node; PQRN(S , 3 , Q) = {n 3 } and PQRN(S < 4, Q) = 
0; hence 2. (a) is not fulfilled and Q does not give the same answer in S3 
as in S4 for every possible basic input tree of S3 and S4 . 

. E mm {S^) = {(niJi,n 2 ),(n 2 J 2 ,n 3 )} andE mm (S®) = {(m, h, n 2 )}; hence 
S5 and Sq are not equivalent, since 1. is not fulfilled and Q does not give 
the same answer in S5 as in Sq for every possible basic input tree of 65 
and Sq. 

Theorem 5. Given two consistent schedules Si and S 2 over the same set of 
transactions and whose QL schedules are equivalent. It is decidable in 0(n & s )- 
time and 0(ns) -space whether they are equivalent. 

Proof. Consequence of Lemma El d 

Finally, we can now combine the previous theorems to show that serializ- 
ability is decidable in our framework. 

Theorem 6. Given a consistent schedule S. It is decidable in 0(n"' .n|) -time 
and 0(ng)-space whether S is serializable. 

Proof. From Theorem [3] and Theorem [5] □ 

6 Conclusion and Future Work 

In this paper we have presented a concurrency control mechanism for scmistruc- 
tured databases. This mechanism is document-independent in the sense that 
two schedules of semistructured transactions are equivalent iff they are equiva- 
lent on all possible documents. This notion of equivalence is a special form of 
view equivalence. The transactions that we consider, consist of simple updates 
(inserting and deleting edges at the bottom of a tree) and queries (simple path 
expressions containing child and descendant steps). We have shown that equiv- 
alence of schedules can be decided efficiently (i.e., in polynomial time in the size 
of the schedule), and that the serializability can be decided in time polynomial 
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in the size of the schedule and exponential in the number of transactions. Im- 
proving this complexity result is expected to be difficult, since it is generally 
known that deciding view serializability is iVP-complete |21| . 

In future work, we will extend the results of this paper by defining the 
behaviour of currently undefined actions, and hence allowing more schedules to 
be serialized. For example, the addition of an edge which is already in the input 
tree is undefined in our current work, and hence the operation fails. However, 
we could also say that as a result of this addition, we obtain an output tree 
which is equal to the input tree, and a message which indicates that the edge 
was already present. In this approach the result of a schedule applied on a 
document tree would be an annotated version of the schedule and an output 
document tree. A schedule would then be serializable iff there exists a serial 
schedule with the same operations, which has, for each document, the same 
output document tree and the same message for each operation. 
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